FIFA 20: explain default vs tuned model with dalex

imports

In [1]:
import dalex as dx # version 0.1.5

import numpy as np
import pandas as pd

from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

load data

Load fifa, the preprocessed players_20 dataset. It contains 5000 'overall' best players and 43 columns. These are:

  • short_name (index)
  • nationality of the player (not used in modeling)
  • overall, potential, value_eur, wage_eur (4 potential target variables)
  • age, height, weight, attacking skills, defending skills, goalkeeping skills (37 variables)

It is advised to leave only one target variable for modeling.

In [2]:
data = dx.datasets.load_fifa()
In [3]:
data.head(10)
Out[3]:
age height_cm weight_kg nationality overall potential value_eur wage_eur attacking_crossing attacking_finishing ... mentality_penalties mentality_composure defending_marking defending_standing_tackle defending_sliding_tackle goalkeeping_diving goalkeeping_handling goalkeeping_kicking goalkeeping_positioning goalkeeping_reflexes
short_name
L. Messi 32 170 72 Argentina 94 94 95500000 565000 88 95 ... 75 96 33 37 26 6 11 15 14 8
Cristiano Ronaldo 34 187 83 Portugal 93 93 58500000 405000 84 94 ... 85 95 28 32 24 7 11 15 14 11
Neymar Jr 27 175 68 Brazil 92 92 105500000 290000 87 87 ... 90 94 27 26 29 9 9 15 15 11
J. Oblak 26 188 87 Slovenia 91 93 77500000 125000 13 11 ... 11 68 27 12 18 87 92 78 90 89
E. Hazard 28 175 74 Belgium 91 91 90000000 470000 81 84 ... 88 91 34 27 22 11 12 6 8 8
K. De Bruyne 28 181 70 Belgium 91 91 90000000 370000 93 82 ... 79 91 68 58 51 15 13 5 10 13
M. ter Stegen 27 187 85 Germany 90 93 67500000 250000 18 14 ... 25 70 25 13 10 88 85 88 88 90
V. van Dijk 27 193 92 Netherlands 90 91 78000000 200000 53 52 ... 62 89 91 92 85 13 10 13 11 11
L. Modrić 33 172 66 Croatia 90 90 45000000 340000 86 72 ... 82 92 68 76 71 13 9 7 14 9
M. Salah 27 175 71 Egypt 90 90 80500000 240000 79 90 ... 77 91 38 43 41 14 14 9 11 14

10 rows × 42 columns

Divide the data into variables X and a target variable y. Here we will be predicting the value of the best players.

In [4]:
X = data.drop(["nationality", "overall", "potential", "value_eur", "wage_eur"], axis = 1)
y = data['value_eur']

The target variable is skewed so we transform it with log for a better fit.

In [5]:
ylog = np.log(y)

import matplotlib.pyplot as plt
plt.hist(ylog, bins='auto')
plt.title("ln(value_eur)")
plt.show()

Split the data into train and test.

In [6]:
X_train, X_test, ylog_train, ylog_test, y_train, y_test = train_test_split(X, ylog, y, test_size=0.25, random_state=4)

create a default boosting model

In [7]:
gbm_default = LGBMRegressor()

gbm_default.fit(X_train, ylog_train, verbose = False)
Out[7]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

create a tuned model

In [8]:
#:# hp tuning
estimator = LGBMRegressor(n_jobs = -1)
param_test = {
    'n_estimators': list(range(201,1202,50)),
    'num_leaves': list(range(6, 42, 5)),
    'min_child_weight': [1e-3, 1e-2, 1e-1, 15e-2],
    'learning_rate': [1e-3, 1e-2, 1e-1, 15e-2]
}

rs = RandomizedSearchCV(
    estimator=estimator, 
    param_distributions=param_test, 
    n_iter=100,
    cv=4,
    random_state=1
)

# rs.fit(X, ylog)
# print('Best score reached: {} with params: {} '.format(rs.best_score_, rs.best_params_))
In [9]:
#:# best parameters after 100 iterations
best_params = {'num_leaves': 6, 'n_estimators': 951, 'min_child_weight': 0.1, 'learning_rate': 0.15}
In [10]:
gbm_tuned = LGBMRegressor(**best_params)
gbm_tuned.fit(X_train, ylog_train)
Out[10]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.15, max_depth=-1,
              min_child_samples=20, min_child_weight=0.1, min_split_gain=0.0,
              n_estimators=951, n_jobs=-1, num_leaves=6, objective=None,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

create explainers for the models

We aim to see real values of the target variable in the explanations (not log). Therefore, we need to make a custom predict_function.

In [11]:
def predict_function(model, data):
    return np.exp(model.predict(data))
In [12]:
exp_default = dx.Explainer(gbm_default, X_test, y_test, predict_function=predict_function, label='default')
exp_tuned = dx.Explainer(gbm_tuned, X_test, y_test, predict_function=predict_function, label='tuned')
Preparation of a new explainer is initiated

  -> label             : default
  -> data              : 1250 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> predict function  : <function predict_function at 0x000002A067A3EEE8> will be used
  -> predicted values  : min = 344216.4120467619, mean = 7005626.232746631, max = 74392447.67353645
  -> residual function : difference between y and yhat
  -> residuals         : min = -16303511.677979246, mean = 443009.76725336973, max = 31906890.977342248
  -> model_info        : package lightgbm

A new explainer has been created!
Preparation of a new explainer is initiated

  -> label             : tuned
  -> data              : 1250 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> predict function  : <function predict_function at 0x000002A067A3EEE8> will be used
  -> predicted values  : min = 299285.6321924611, mean = 7161084.07190136, max = 102801045.09887527
  -> residual function : difference between y and yhat
  -> residuals         : min = -13910737.11250035, mean = 287551.9280986407, max = 26885931.631276377
  -> model_info        : package lightgbm

A new explainer has been created!

dalex functions

image.png

Above functions are accessible from the Explainer object through its methods.

Each of them returns a new unique object that contains a result field in the form of a pandas.DataFrame and a plot method.

In [13]:
mp_default = exp_default.model_performance("regression")
mp_default.result
Out[13]:
mse rmse r2 mae mad
0 7.813017e+12 2.795177e+06 0.905496 1.293972e+06 640241.91959
In [14]:
mp_tuned = exp_tuned.model_performance("regression")
mp_tuned.result
Out[14]:
mse rmse r2 mae mad
0 5.258783e+12 2.293204e+06 0.936391 1.106670e+06 547745.051113
In [15]:
mp_default.plot(mp_tuned)

This are very big values so the difference on paper may be very subtle.

What are the differences between these two models? Let's find out.

Customize the computation with parameters:

  • loss_function function to use for drop-out loss evaluation

  • B number of bootstrap rounds (e.g. 15 for slower computation but more stable results)

  • N number of observations to use (e.g. 1000 for faster computation but less stable results)

  • variable_groups Dict of lists of variables. Each list is treated as one group. This is for testing joint variable importance

In [16]:
X.columns
Out[16]:
Index(['age', 'height_cm', 'weight_kg', 'attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance',
       'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle', 'goalkeeping_diving',
       'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes'],
      dtype='object')
In [17]:
variable_groups = {
    'age': ['age'],
    'body': ['height_cm', 'weight_kg'],
    'attacking': ['attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys'],
    'skill': ['skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control'],
    'movement': ['movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance'],
    'power': ['power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots'],
    'mentality': ['mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure'],
    'defending': ['defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle'],
    'goalkeeping' : ['goalkeeping_diving',
       'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes']
}
In [18]:
vi_default = exp_default.model_parts(variable_groups=variable_groups, B=15)
vi_tuned = exp_tuned.model_parts(variable_groups=variable_groups, B=15)

Customize the plot with parameters:

  • vertical_spacing value between 0.0 and 1.0 (e.g. 0.15 for more space between the plots)

  • rounding_function rounds the contributions (e.g. np.round, np.rint, np.ceil)

  • digits (e.g. 2 for np.round, None for np.rint)

In [19]:
vi_default.plot(vi_tuned, max_vars=6, rounding_function=np.rint, digits=None, vertical_spacing=0.15)

Variables connected with body and power aren't important for these models. It is also true for goalkeeping. This might mean that goalkeepers predictions aren't accurate. The most important factors in predicting players value are skill, attacking and movement.

It seems like the default model is focusing on movement variables too much and doesn't find other variables so important, especially skill. The tuned model finds mentality and defending quite important. Next, we will examine these variables closer.

Aggregated Profiles

Choose a proper algorithm. The explanations can be calulated as Partial Dependence Profile or Accumulated Local Dependence Profile.

The key parameter is N number of observations to use (e.g. 800 for slower computation but more stable results).

Here we will use ale plots, which work better if the explanatory variables are correlated.

In [20]:
ale_default = exp_default.model_profile(type = 'accumulated', N=800)
ale_default.result['_label_'] = 'ale default'
Calculating ceteris paribus!: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:12<00:00,  2.86it/s]
Calculating accumulated dependency!: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:13<00:00,  2.74it/s]
In [21]:
ale_tuned = exp_tuned.model_profile(type = 'accumulated', N=800)
ale_tuned.result['_label_'] = 'ale tuned'
Calculating ceteris paribus!: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:32<00:00,  1.14it/s]
Calculating accumulated dependency!: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:13<00:00,  2.84it/s]
In [30]:
ale_default.plot(ale_tuned, variables = ['goalkeeping_positioning', 'power_stamina',
                                           'mentality_vision', 'defending_marking',
                                           'attacking_finishing', 'attacking_heading_accuracy',
                                           'attacking_short_passing', 'skill_ball_control'])

Overall, we can see that the tuned model is using more variables. Examples are defending_marking, goalkeeping_positioning, mentality_vision, power_stamina, skill_ball_control and attacking variables.

It also acts differently with variables like age and movement_reactions.

In [23]:
ale_default.plot(ale_tuned, variables = ['age', 'movement_reactions'])

Variable Attribution

Choose a proper algorithm. The explanations can be calulated as Break Down, iBreakDown or Shapley Values.

For type='shap' the key parameter is B number of bootstrap rounds (e.g. 10 for faster computation but less stable results).

Let's find out what attributes to the value of the best players.

In [24]:
va = {'ibd':[], 'sh':[]}

for name in data.index[0:3]:
    player = X.loc[name,]
    
    ibd = exp_tuned.predict_parts(player, type='break_down_interactions')
    ibd.result.label = name
    
    sh = exp_tuned.predict_parts(player, type='shap', B=10)
    sh.result.label = name
    
    va['ibd'].append(ibd)
    va['sh'].append(sh)
In [25]:
va['ibd'][0].plot(va['ibd'][1:3], rounding_function=np.rint, digits=None, max_vars=10)
In [26]:
va['sh'][0].plot(va['sh'][1:3], rounding_function=np.rint, digits=None, max_vars=10)

Looking at the Break Down plots, age and movement_ractions variables are standing out. Let's focus on them more.

In [41]:
cp = exp_tuned.predict_profile(X.iloc[2:3,], variables=['age', 'movement_reactions']) # variables to calculate 
Calculating ceteris paribus!: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 80.22it/s]
In [43]:
cp.plot(size=3, title="What If? Neymar Jr") # larger width of the line and dot size & change title

Here we see how the prediction would change if Neymar Jr was younger/older or had lower movement_reactions.

Hover over all of the above plots for tooltips with more information.

Plots

This package uses plotly to render the plots:

Resources

In [ ]: